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Abstract 



Optimization of complex functions, such as the output of computer simulators, is 
a difficult task that has received much attention in the literature. A less studied prob- 
lem is that of optimization under unknown constraints, i.e., when the simulator must 
be invoked both to determine the typical real-valued response and to determine if a 
constraint has been violated, either for physical or policy reasons. We develop a statis- 
tical approach based on Gaussian processes and Bayesian learning to both approximate 
the unknown function and estimate the probability of meeting the constraints. A new 
integrated improvement criterion is proposed to recognize that responses from inputs 
that violate the constraint may still be informative about the function, and thus could 
potentially be useful in the optimization. The new criterion is illustrated on synthetic 
data, and on a motivating optimization problem from health care policy. 

Key words: constrained optimization, surrogate model, Gaussian process, sequential 
design, expected improvement 

1 Introduction 

A common optimization problem that arises in fields ranging from applied engineering to 
public policy is to find x* = argminajg;^ f{x)j subject to constraints: x* G C, where we may 
only learn about the relationship between x and f{x) : A" — )■ M and the constraint region C 
through expensive evaluations of the noisy joint process 



The real-valued noise variance, rj^, is unknown but may be zero, and Ec indicates that the 
constraint mapping may be random. In particular, the constraint region C G X is well- 
defined but often non-trivial. Although it will typically be deterministic {Ec = 0), this is not 
required by our treatment. Finally, we suppose that observing the joint response {Z, C){x) is 
expensive. So we wish to keep the number of evaluations, {xi, zi, ci), . . . {xn, zn, cat), small. 
One way to do this is to build regression and classification models JnIx) for f{x) and cn{x) 




(1) 



for c(x) based on the data. The surrogate surfaces may be searched to find x' yielding a small 
objective in expectation, and satisfying the constraint with high probability. We can then 
repeat the process with + 1 points, including (x', ^(x'), C(a:')), stopping when convergence 
in the location of x* is achieved, or when resources are exhausted. 

To shed light upon the difficulty in solving this problem, and to thereby suggest possible 
points of attack, consider the following simplification where the constraint region C is known 
at the outset (i.e., there is no need to estimate ctv)- In this case a sensible approach is as 
follows. Obtain realizations z{x) of ^{x) only for a; G C with the largest expected improve- 
ment (EI, Jones et all . 19981 ) under /^v [more on this in Section [2] and proceed to construct 



/at+i by adding the {x,z{x)) pair into the design. This presumes that evaluating f{x) for 
X G A" \ C is a waste of resources. But this need not be so, since Z{x), for any x, contains 
information about /, and therefore about promising location(s) for x* G C. It could even be 
that x' ^ C is best at reducing the overall uncertainty in the location of x* G C, through an 
improved new surrogate /n+i- When this is the case [e.g., see Section 133] it makes sense to 
sample Z(x') for x' ^ C despite the constraint violation. 

Assessing when this odd maneuver is advantageous requires a more global notion of im- 
provement; EI cannot directly quantify the extent to which x' ^ C improves our information 
at X G C. Finally, when C is not known a priori, new evaluations {x',z' = z{x')) provide 
information about both / and c through their surrogates /at and cn- Thus incremental 
decisions toward solving the constrained optimization problem must incorporate uncertainty 
from both surrogates. We propose a new integrated improvement statistic to fit the bill. 

The rest of the paper is outlined as follows. In Section |2] we outline EI for (unconstrained) 
optimization and the GP surrogate models upon which it is based. In Section [3] we develop 
the conditional and integrated expected improvement statistic (s) for the case of known con- 
straints, with an illustration. We extend the method to unknown constraints in Section 
m and demonstrate the resulting constrained optimization algorithm on synthetic data. In 
Section [5] we consider a motivating problem from health care policy research, and conclude 
with some discussion and extensions in Section [HI Software implementing our me thods, and 



the s pecific code for our illustrative examples, is available in the plgp package (IGramacy 



20101) for R on GRAN. 



2 Previous Work 
2.1 Surrogate Modeling 

The canonical choice of su r rogate model for computer experiments is th e stati onary Gaussian 



process (GP, I Sacks et al.l . Il989t lO'Hagan et all . Il999l : ISantner et al.l . l2003l ). which is one 



way of characterizing a zero mean random process where the covariance between points is 
explicitly specified through the function C(x, x') = a'^K{x,x'). Let Zn = (^i, . . . ,zn)'^ be 
the vector of observed observed responses at the design points Xi, . . . , xat collected (row- wise) 
in X]y. Gonditional on this data D]\r = {X^r, Z]y}, the (posterior) predictive distribution of 
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Z{x) at a new point x under the GP is normal with 



mean 
and variance 



a 



N 



[x) = a'^[K{x, x) 



(2) 



k'^{x)K^^kN{x)], 



where kjj{x) is the A^-vector whose i"* component is K{x,Xi), and Kj^j is the N x N matrix 
with i,j element K{xi,Xj). These are sometimes called the kriging equations. Joint pre- 
diction at a collection of points X is multivariate normal with mean vector zi\f{X) and co- 
variance matrix EAr(X) w hich are defined by the st raightforward matrix extension of k^iX) 
and K{X,X). We follow iGramacy and Led (120081 ) in specifying that ■) have the form 
K{x,x'\g) = K*{x,x') + ri5x,x'i where 5.,. is the Kronecker delta function, and K* is a true 
correlation function. The rj term, referred to as the nugget, is positive (77 > 0) and provides 
a mechanism for int roducing measur ement error into the stochastic process — implementing 



> in Eq. ([T]) (iGramacyl . l2005l . appendix). It causes the predictive equations ([2]) to 
smooth rather than interpolate the data [Xj^^Ziq). It is common to take •) from a 



parametric family, such as the separable Matern or power families (e.g.. lAbrahamsenl . 119971 ). 
which roughly model K*{-,-) as an inverse function of coordinate-wise Euclidean distance. 
We prefer the power family, which is standard for computer experiments. 



2.2 Optimization by Expected Improvement 



Conditional on a GP surrogate /jv, a step towards finding th e minimum may be based upon 
the expected improvement (EI) statistic f Jones et al.l . Il998l ). For a deterministic function 
{rj = 0), the current minimum /min = mm{zi, . . . , z^} is deterministic. In this case, the 
improvement is defined as I{x) = max{/inin — Z{x), 0}. The next location is chosen as 



x' = argmaxE{/(x)}, 



(3) 



where the expectation is taken ov e r Z(x ) ~ -Fjv(x), the predictive distribution implied 
by /at evaluated at x. Jones et al. f 19981 ) give an analytical expression for the EI: 



E{I{X)} = {Uin -Zn{x))^ 



min Z]\f[X) 
(TAr(x) 



/min - Zn{x) 

d-Nix) 



(4) 



Basically, the EI is the cumulative distribution of the predictive density that lies "un- 
derneath" f rnm- A rclevaut diagram illustrating EI appears in Figure [1] in Section 13.1.11 



Jones et al.l ( 119981 ) also provide a branch and bound algorithm for performing the maximiza- 
tion over X to find x'. Once x' is chosen it i s added into the d esign as {xn+i, zn+i) = 
{x', f{x')) and the procedure repeats with /n+i- I Jones et al.l (119981 ) use maximum likelihood 
inference to set the parameters for f^, i.e., d only since r] = 0, and call the resulting iterative 
procedure the efficient global optimization (EGO) algorithm. The above choice of /mm is 
sensible but somewhat arbitrary. Another reasonable choice that we promote in this paper 
is /min = TcimzNix), the minimum of the (posterior) mean predictive surface. 
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The situation is more complicate d for noisy responses. We must then estimate the nugget, 
rj, and extend the I Jones et all (119981 ) definition of /min to be a random variable: the first order 
statistic of Zi, . . . , Z^. Calculating the EI would thus require integrating over in Eq. ([3]). 
This breaks the analytical tractability of EGO algorithm, however one can always proceed 
by Monte Carlo methods. Once in the Monte Carlo framework, extensions abound. For 
example, it is trivial to take a Bayesian approach and thereby factor parameter uncertainty 
into the EI calculation. Conditional on the parameters however, choosing /min min zjv(a:) 
is still deterministic. So this choice allows an analytical approach to proceed when point- 
estimates (i.e., MLEs) of parameters are used, or it leads to a more efficient Monte Carlo 
algorithm when sampling from the Bayesian posterior. The downside of the Monte Carlo 
approach, whether taken for Bayesian or /min considerations, is that the branch and bound 
algorithm for determining x' in Eq. ([3]) is no longer available. However, proceeding with a 
discrete set of space- filling candidates, an d leveraging direct optimization methods in tandem, 
has proved fruitful (ITaddy et all . l2009bl ). 



2.3 Towards Constrained Optimization 

Ours in no t the first atternpt at t ackling the constrained optimization problem via surrogate 
modeling. ISchonlau et al.l (119981 ) consider deterministic responses (77 = 0) where the known 
constraint region can be written as < Ck{x) < bk, for k = 1, . . . , K. They then treat the 
Cfc(x) as additional response variables that co-vary with /(x). This breaks the analytical 
tractability of the EI calculation. Upon assuming that the K + 1 responses are independent 
the calculation is again tractable, otherwise a Monte Carlo approach is needed. We are not 
aware of any previous literature addressing our more general problem: where the function 
/ may not be deterministic, and when there are unknown constraints of arbitrary form. 
Even in simpler settings, like the one above, it may be advantageous to sample outside the 
constraint region. This requires a new improvement statistic — one that weighs the overall 
expected improvement of the next sequentially chosen design point in aggregate. 



3 Integrated Expected Conditional Improvement 

Here we generalize the EI framework to accommodate the drawbacks outlined above. To start 
with, we assume that constraints are deterministic, and known (with trivial computation) 
in advance. Section |4] provides extensions for unknown constraints. 
Define the conditional improvement as 

I{y\x) = max{/min - Z{y\x), 0}, (5) 

where Z{y\x) ~ -F/v(?/|x), which is the predictive distribution of the response Z{y) at a 
reference input location y under the surrogate model /tv given that the candidate location 
X is added into the design. We do not use an -|- 1 subscript for the posterior predictive 
distribution because the realization of the response z{x) is not yet available. 
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The expected conditional improvement (ECI) at the reference point y is then E,{I{y\x)}. 
Here the expectation is over all of the random quantities: the distribution of Z{y\x), and 
perhaps of /mm depending upon how it is defined. The ECI may be evaluated at all pairs 
of inputs (x, y) G X. The potential to generalize EI, which accounts for improvement at the 
point X alone, comes by integrating over the choices for y. Let g{y) denote a density over 
y E X which may be uniform in a bounded region. Then the integrated expected conditional 
improvement (lECI) is defined as 

E,{I{x)} = - [ E{Iiy\x)}giy)dy. (6) 

This suggests using x' = argmax^jg;!:' Kg{I{x)} as the next adaptively sampled point. As long 
as K{I{y\x)} < K{I{y)} for all x E X, this statistic is defensible. Defining /mm carefully 
[see Section 13.1.1] ensures that this monotonicity condition holds. 

The negation in Eq. ([6]) keeps lECI in line with the convention of maximizing, i.e., 
of preferring large EI statistics over small ones. To explain, consider how I{y\x) "looks 
ahead" . We wish to measure an improvement at x, but in a roundabout way we assess that 
improvement at a reference point y instead, supposing x has been added into the design. If 
y still has high improvement potential after x has been added in, then x must not have had 
much influence on the improvement at y. If x is influential at y, then the improvement at y 
should be small after x is added in, not large. 

We can alternatively define lECI as the expected reduction in improvement at the refer- 
ence location, y, when x is added into the design: 

= / (E{J(y)} - E{Iiy\x)})g{y) dy, (7) 
Jx 

which is guaranteed to be positive under our monotonicity assumption. We would then take 
the x' which gave the largest reduction. But clearly this is within an additive constant (the 
weighted-average EI over g{y)) of the definition given in Eq. and is thus equivalent. 

The integrated approach allows constraints to be handled through g{y). E.g., g{y) can 
be uniform for y E C and zero otherwise. Or, [as we discuss in Section |4] it can give higher 
weight to y with a greater chance of satisfying the constraint. When there are no constraints, 
choosing g{y) uniform on y E X yields an aggregated statistic that will offer a more global 
search, compared to EI, in a manner similar to how the expected reduc tion in variance 



generalizes the predictiv e variance for sequential design by active learning (jSeo et al.l . 12000 



Gramacy and Led . l2009l ) 



3.1 Expected Conditional Improvement 

The key ingredient in calculating the ECI is an assumption about how Z{y\x) behaves relative 
to Z{y). Let F/v(?/|x) denote the distribution of Z{y\x). Overloading the notation somewhat, 
let fN{z{x)) denote the density of Z{x) under F^, and likewise fN{z{y)\x) for Z{y\x). By 
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the law of total probability, 



fN{z{y)\x) = j fN{z{y),z{x)\x) dz{x) 

= / fN+i{z{y)\x,z{x))fN{z{x)) dz{x) 
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where fN+i{z{y)\x, z{x)) is the predictive density of Z{y) when the design matrix and re- 
sponse vector are augmented by {x,z{x)). Note that the above expressions involving z{y) 
have an implicit conditioning upon y. For an arbitrary surrogate, computing the integral in 
Eq. dH]) analytically would present a serious challenge. However, under a GP surrogate it is 
trivial since Fn and F^+i are both (univariate) normal distributions (|2]), and a convolution 
of normals is also normal. Trivially, the mean and variance of the (normal) predictive density 
fN+i{z{y)\x, z{x)) is unchanged after integrating out Z{x) since the GP is not dynamic, so 
there is no update from /^r without observing z{x]\f+i). 

But at the same time, the predictive variance ([2]) does not depend upon the responses, 
Z]\r or z{x) via Z]\r^i. So we can deduce what the variance of the predictive density 
fN+i{z{y)\x, z{x)) will be once z{x) arrives. We will have aj^_^_^{y\x, z{x)) = ^^]^_^_l{y\x) under 
the assumption that the evidence in z{x) does not update/change parameters of the GP 
(which it can't if it is not observed!). Now, alf_^_^{y\x, z{x)) depends upon K]^^^^{x) whose 

_l_ ]^st column are populated with K{xi,x) for i = 1,. . . ,N and with K{x,x) 

appearing in the bottom right-hand corner. So i^^^^(x) can then be obtained in terms of 
K]^^ via partitioned inverse equations. If 



where g{x) = —fj,{x)K]^^kiy{x) and fi^^{x) = K{x,x) — kJj{x)Kjj^ki^{x). This saves us 
from performing any additional 0{N^) matrix operations. So a%j^i{y\x) = a^[K{y,y) — 
kJl_^_l{x] y) K]^^^^{x)kN+i{x; y)] where kjf^i{x; y) is an (A^-l-l)-vector whose first N entries are 
identical to k^ly) and with an A^+l*^* entry of K{y, x). The amount by which a%_^_i{y\x, z{x)) 
is reduced compared to o"^(y) is then readily available. Let G{x) = g{x)g^{x). Then, 



So we can see that the deduced predictive variance at y will be reduced when z{x) is observed 
by an amount that depends upon how far apart y and x are. This is not only sensible, but 
will also be helpful for determining the influence of x in improvement calculations. 

To sum up, we propose to define F/v(?/|x), for the purposes of sequential design, to be 
a normal distribution with (true) mean ZN{y\x) = ZN^y) and deduced variance ajf{y\x) = 
^N+iiy\^^ ^i^)) = ^N+iiy\^) given in Eq. ([9]), above. As with the kriging equations ([2]), 



Kn+i{x) 



Kn k^ix) 
k%{x) K{x,x) 



then 




[Kn + 9{x)g'^{x)fi ^(x)] g{x) 

g^{x) fi{x) ' 



+ 2kl{y)g{x)K{x, y) + K{x, yf^i]. 



(9) 
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joint sampling for a collection of (M) reference inputs is possible via the appropriate 
matrix extensions to kN{YM) and K(Ym,Ym) in order to derive Z]\f(YM\x) and T,n(Ym\x). 

Now, with an appropriate definition of a deterministic /mim the same analytic expression 
for the EI from Section [2] can be extended to the ECI: 



E{I{y\x)} = 

(/min - 2Ar(l/|x))$ 



(10) 



fmm-ZN{y\x)\ , . . , f f min - Zn {y\x) 

+ (^N{y\x)(p\ 



^N{y\x 



^N{y\x) 



If we car ed only abo u t the ECI (without integration (^), the branch and bound algorithm 
given by iJones et al.l (119981 ) would apply leading to a conditional EGO algorithm. 



3.1.1 Choosing /j^in 

Figure [1] illustrates how a deterministic choice of /min can influence the ECI. Consider two 
cases ((a) and (b)), which pertain to the choices for /min introduced in Section \T2\ (rep- 
resented by horizontal lines): (a) uses only the observed locations and (b) uses the whole 
predictive curve. We will return to details of these choices shortly. In the figure, the solid 
parabolic curve represents the predictive mean surface E{Z(-)}. The EI is the area of the 
predictive density drawn as a solid line, plotted vertically and centered at z{y), which lies 
underneath the horizontal line(s), representing choices of /min- The ECI is likewise the area 





- Z{y) ~ h{y) 


: 






- Z{y\x) ~ fN{y\x) 1 

1 1 
1 / 






(a), /min 








(t*)' /min 


\ \ 
\ \ 


\y 


X 




\ \ 
\ \ 







Figure 1: Illustrating how the choice of /min influences the ECI. The solid curve represents 
the mean-predictive E{Z'(-)}. The densities of Z{y) and Z{y\x) are shown as solid and 
dashed "bell-curves", respectively. In (a) /min is taken to be the mean predictive at the 
input locations whereas in (b) it is taken to be the minimum of predictive-mean surface. 
The respective improvements are the areas of the densities underneath /min- 
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of the predictive density drawn as a dashed hne lying below the horizontal line(s). This 
dashed density has the same mean/mode as the solid one, but it is more sharply peaked by 
the influence of x. If we suppose that the densities, drawn as bell-curves in the figure, are 
symmetric (as they are for a GP), then it is clear that the relationship between ECI and EI 
depends upon /min- As the dashed line is more peaked, the left-tail cumulative distributions 
have the property that FN{fmm\x) > F7v(/mm) for all > E{Z{y\x)} = E{Z{y)}, to which 
choice (a) for f^i^ corresponds. Therefore K{I{y\x)} > E{/(y)} in this case, violating our 
desired monotonicity property. But for choice (b) the ECI represents a reduction compared 
to the EI, since /min < E{Z {y\x)} , thus satisfying the monotonicity property. 

Case (a) in Figured] is meant to represent taking /min = min{2;i, . . . ,zn}, deterministi- 
cally. It may similarly represent the minimum of the mean-predictive at the Xn locations, 
which would coincide with the minimum of the Zn values in the no-noise (77 = 0) case. 
In the noisy case {rj > 0) /min in Eq- dS]) is a random variable whose distribution can be 
approximated by simulation from F^- But this extra computational effort would be in 
vain because the monotonicity property is not guaranteed. Case (b) corresponds to taking 
/min = minE{Z(-)}, the minimum of the posterior mean-predictive — another deterministic 
choice. In this case it is clear that /min will always cut through the density of Z{y\x) at or 
below its mean/mode E{Z(i/|x)} = E,{Z{y)} and ensure that the monotonicity property is 
satisfied. Accordingly, we shall use this choice throughout the remainder of the paper. 

3.1.2 A Monte Carlo Approach for Calculating the ECI 

The following procedure may be used to obtain samples of the ECI via the GP surrogate 
posterior predictive /at, taking full account of uncertainty in the parameters 9 = {a'^,d,ri). 
The procedure is borne out via Monte Carlo sampling for 6 in Figure [21 If 6' is considered 



For t = 1, . . . ,T, current design Dj^ = {Xjy, Zj^), repeat: 

1. Sample 6*^*^ from the posterior distribution conditional 
upon /at, priors, and Dn, and possibly conditional upon 

in an MCMC setup 

2. Calculate /min = mm ZN{-\d^^^) 

3. Obtain the t^^^ sample from E{/(?/|x)} as 
EW{/(|/|x)} = E{/(a;|2/)|^W}, following Eq. with 

ZN{y\x,e^''>) and a%{y\x,9^'^) 



Figure 2: Monte Carlo approximation of the ECI statistic. 

known, or has been estimated offline, e.g., via maximum likelihood, then we may skip the 
loop (and Step 1), taking T = 1 with 9^^^ = 9. In either case, an estimate of the ECI is 
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obtained by ergodic averaging: 



E{/(y|x)}^lX^EW{J(y|x)}. (11) 

t=i 



3.2 Integrated Expected Conditional Improvement Algorithm 

Calculating the lECI ([6]) from the ECI requires integrating over y & X according to g{y), 
which may be uniform in a bounded (constraint) region. It will not generally be possible 
to integrate analytically, so we propose to augment the Monte Carlo procedure from Sec- 
tion [21121 Given a large number of sampled reference locations Ym = y^^\ • • • , y*-*^^ ~ g, the 
lECI may be approximated with T Monte Carlo samples from the ECI as follows. 



^ M T 

E,{I{x)} ^ --— 5^ $^EW{J(y(")|x)} (12) 



MT 

m=l t=l 



When the parameters 6 are known, T = 1 as before. With larger M (and T) we obtain an 
improved approximation, and in the limit we have equality. In the case whe re q is uniforrn 



over a convex region, a grid or maximum entropy design may be preferred ( ISantner et al.l 



2nn.i Section 6.2.1) . When the marginals of g are known, a Latin Hypercube Design (LHD, 



Santner et al. . 20031 Section 5.2.2) may be more generally appropriate. 



If we choose (or are required) to work with a size M grid, design, or LHD of reference 
locations y G A", we may view g as discrete and finite measure. An alternate approach in 
this case is to forgo (re-)sampling from g and compute a weighted average instead: 

T M 

E,{/(a:)} ^ E M E EW{/(y(™)|x)}^(y(™)). (13) 

4=1 m=l 

This has the disadvantage that the ECI may be evaluated at many reference locations y^"^^ 
with low (or zero) probability under g. But it has the advantage of an implementation that 
is easily adapted to the unknown constraint situations described shortly. 



3.3 Illustrating lECI 

To illustrate lECI consider the following process E{Z(x)} = /(x) = sin(x) +2.5500.45(3^ — 3), 
observed for x G [0, 7]. As a mixture of a sinusoid and normal density function (with /i = 3 
and a = 0.45) it has two local minima in this region. To make things interesting, realizations 
of the process are observed with i.i.d. noise so that Var{Z(x)} = 0.15^. The top-left panel of 
Figure [3] shows one random realization of this process at LHD inputs. The predictive mean 
and 90% interval obtained by sampling from the posterior under the CP is also shown. A 
visual inspection of the surface (s) reveals that, indeed, there are two local minima. 

Below that panel, on the bottom-left, the EI (solid black) and lECI (dashed red) surfaces 
are plotted, normalized to appear on the same [0, 1] scale. As a further visual aid, the design 
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predictive surface predictive surface 




1 \ \ \ \ \ \ r n I I I \ \ I r 

01234567 01234567 




Figure 3: Comparing EI and lECI. The top panels show the design and posterior predictive 
surface. The bottom panels show EI and lECI statistics for the corresponding surfaces above. 
In the case of constrained optimization, in the right panels, the constraint violation region 
is shown with slashes. 



X]\[ is also shown, and the vertical lines crossing the x-axis intersect with the curves at their 
maxima. We took a uniformly spaced set of 100 candidate locations in [0,7], our X, and 
calculated the EI and lECI at x G Af. Likewise, we took the same M = 100 points as 
reference locations Ym = X for the lECI calculations via Eq. (1121) . EI recommends taking a 
sample from the left local minima, although the relative heights of the two disparate regions 
of highest EI betrays that this decision is indeed a "close call". In contrast, lECI suggests 
taking the next sample from the right local minima, and with much greater decisiveness. 
The lower concentration of samples nearby this right-minima lead to higher variance in that 
region which may be pooled by the more globally-scoped lECI. 

The right-hand panels in Figure [3] show a similar sequence of plots in the presence of 
a known constraint C = [0,2] U [4,7]. To illustrate EI and lECI in this scenario, consider 
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the random realization and corresponding posterior predictive surface in the top-right paneL 
Here the Xj^j design locations all reside inside C. The bottom-right panel shows the EI 
statistic over the entire (discrete) range for x G A', as above. Those parts of the EI curve 
corresponding to inputs which violate the constraint are dotted. The EI is maximized outside 
of the constraint region near x = 2.75, with the maximal value inside C at the x = 4 
boundary. The lECI statistic is also shown over the entire range, but the y^™-* locations are 
restricted to C. I.e., Ym = X H C. This is so that we may consider the extent to which 
every location x & X reduces the average conditional improvement y E C. Observe that the 
maximal lECI point is a; = 3.75. This point gives the greatest reduction in improvement 
averaged over the constraint region, even though it does not, itself, satisfy the constraint. 



4 Dealing with Unknown Constraints 

Here we extend the lECI to unknown constraints. Much of the necessary scaffolding has 
already been built into the lECI via g{y), e.g., g{y) = F{C{y) = 1). It remains for us to flesh 
out the Monte Carlo by incorporating the surrogate c^r for C{y). We extend the parameter 
vector 6 to contain parameters for both surrogates: 6 = {6f,6c)] and the data to include the 
class/constraint labels: Dn = {Xn,Zn,Cn)- Inference for unknown 6\D]\f is via samples 
from the joint posterior. An appropriate choice of cat is discussed in Section 14. 1[ 

For now, overload the generic classification surrogate notation to let CN{y^'^^Oc^) denote 
the probability input y^'^^ satisfies the constraint given parameters Then, 

T M 
f=l m=l 

Note that in E(*){/(y(™) |x)} there is an implicit dependence upon , unless these parame- 
ters are taken as known. In that case we may drop the [t) superscript from the ECI expression 
in Eq. (HM . and re-arrange the order of summation to avoid unnecessarily re-calculating the 
ECI for each t. Observe that Eq. (fT4l) extends Eq. (ITSll rather than (fT2|) . Sampling from the 
surrogate gN, rather than simply evaluating the related quantity cat, would not generally be 
straightforward, and so we prefer to work with design-based candidates y E X. 



4.1 An Appropriate Constraint Surrogate, and Sequential Infer- 
ence 

An appropriate partner to the canonical GP (regression) surrogate /at for / is a classification 
GP (CGP) surrogate cat for c. For details on C GP specification and corresponding Monte 



Carlo inference based on MCMC, see iNeall (119981 ). As in the regression case, the CGP model 



is highly flexible and competitive with, or better than, the most modern models for non- 
parametric classification. However, batch inference methods based on MCMC are at odds 
with the sequential nature of the design strategy. Except to guide the initialization of the 
new Markov chain, it is not clear how fits from earlier iterations may re-used in search of 
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the next design point. So after each sequential design step the MCMC must be re-started 
and iterated until convergence. The result is a slow algorithm. 

So instead of taking the traditional, established, MCMC approac h to C/GP inference 



we fol low a new sequential Monte Carlo (SMC) approach outlined by iGramacy and Poison 



( 120101 ). They show how CP and CGP models can be implemented in an onli ne setting, by 



quick ly updating a discrete approximation to the posterior via particle learning (jCarvalho et al. 



20081 ). This approach leads to fast online — and in some c ases statistically super i or (i.e ., lower 



MC error) — posterior summaries compared to MCMC. IGramacy and PolsonI ( 120101 ) go on 
to describe to how EI for optimization and entropy based boundary exploration for classi- 
fication can proceed efficiently with particles. This is easy to extend to lECI by coupling 
the regression and classification models (/at and cn) via the Monte Carlo approximations 
described earlier in this paper. 

4.2 Illustrations and Examples 

We provide two synthetic data examples where the constraint region is unknown. In both 
cases we take the candidate and reference locations (identically: = X) as a. LHD ran- 
domly genera t ed at t he beginnin g of each round and then au gmented with an oracle point 
( iTaddv et all l2009bh . We follow I Gramacv and Taddvl (I2OIOI ) in taking the oracle point as 



the local maximum obtained via numerical non- derivative minimization initialized at the last 
added design point and searched over the MAP predict ive surface inferred in the p revious 



round. An implementation via particles is described by ( iGramacy and PolsonI . I2OIOI ). 

The objective function and constraint region for the first example was presented in Section 
13.31 We initialize the optimization with a size 20 LHD, and then collect 60 points by lECI 
with 100 fresh candidates/reference locations as described above. Figure H] summarizes the 
results after the 80 total samples were gathered. Observe from the plots in the top row 
that most samples (after the 20 initial ones) were gathered in the two local minima, with a 
few taken outside C. The oracle candidates (solid circles) indicate the most likely locations 
of minima according to the posterior predictive distribution. The bottom panes show an 
estimate of cn via the posterior mean probability of violating the constraint {P{cn{x) = 1)), 
and a progress meter showing the largest (log) expected reduction in average improvement 
([7]) at each round. Observe how the ability to improve upon the current minimum decreases 
over time, giving a heuristic indication of convergence. 

In our second example, the objective function for 2-d inputs x = (xi,X2) is given by 

f{xi,X2) = —w{xi)w{x2), where (15) 
w{x) = exp (— (x — 1)2) + exp (-0.8(a; + 1)^) - 0.05 sin (8(x + 0.1)) 

and observed without noise. The constraint (satisfaction) region is the interior of an ellipse 
defined by the 95% contour of a bivariate normal distribution centered at the origin, with cor- 
relation —0.5 and variance 0.75^. The true global minimum is at (xi, X2) = (—1.408, —1.408), 
which does not satisfy the constraint. There are, however, three other local minima — two 
of which satisfy the constraint. The setup is as described above for the 1-d example except 
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Figure 4: Progress in 1-d optimization after 80 samples: top-left shows the posterior mean 
predictive surface (of /at); top-right shows sampled x- values (open circles) and oracle candi- 
dates (closed) before and after the initial design, as separated by the vertical bar; horizontal 
lines indicate the unknown constraint region; bottom-left posterior mean of constraint (vi- 
olation) surface (cat); bottom-right the maximum of the log expected reduction in average 
improvement (l7l) over time. 



that the optimization is initialized with 25 LHD samples, after which 100 are gathered by 
lECI with 100 fresh candidates in each round. Figure [5] summarizes the results after the 
125 total samples were gathered. Observe that very few samples were gathered outside the 
unknown constraint region, except near the local minima. It is sensible to sample heavily 
on the boundary of the constraint region where the response is quickly changing and local 
minima are likely to occur. This is in case the global minimum is on the boundary, and 
also helps to extract the GP parameters in regions of highest importance. Notice that large 
concentrations of samples occur for the two minima well inside the constraint region. But 
the bottom-right plot indicates that further progress can be made by additional sampling. 
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Figure 5: Progress in 2-d optimization after 125 samples: top-left posterior mean predictive 
surface; top-right: sampled (xi, X2)-values (open) and oracle candidates (closed); bottom-left 
posterior mean of constraint surface; bottom-right: the progress meter ([71). 



5 Health Policy Optimization 

Our motivating example in volves a simu l ation of health care policy in the United States. 
The COMPARE simulator (IGirosi et al.l . l2009l ) was developed at the RAND Corporation 
to predict the effect of various health care policies in terms of individual choices of health 
insurance and the associated costs. It is an agent-based microsimulation model that uses a 
maximum utility approach to predict the health insurance decisions of individuals, families, 
and firms as a function of a wide range of inputs on available types of policies, and on 
taxes, penalties, and regulations. The population is simulated based on Census Bureau 
data. Additional datasets provide values for many of the parameters in the simulation, and 
other parameters are set as part of the possible policy interventions. However, there are 
several calibration parameters that are tuned so that when the simulator is run on current 
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policies, it makes predictions as close as possible to the current observable situation in the 
United States. Such a calibration can be viewed as a minimization problem, choosing the 
values of the calibration parameters to minimize the discrepancy between predictions and 
reality. This setup is comr non for computer simu l ators and has been investigated in the 



unconstrained setting (e.g., Kennedy and O'Hagaru . 1200 ll ). What differs from the standard 



setup here is the presence of unknown constraints. 

The simulator has a number of inputs and outputs, here we focus on a subset deemed 
most important by our collaborators, the designers of the simulator. The inputs over which 
we optimize are a set of six calibration parameters: utility tuning parameters for adults 
on ESI programs, adults on individual programs, and adults on public programs, and an 
analogous set of three parameters for children. The outputs of interest are the predicted 
counts in each type of insurance (or the uninsured category) and the elasticities of response 
for the key categories of adults in individual plans, adults in restricted individual plans, 
uninsured adults, children in individual plans, children in restricted individual plans, and 
uninsured children. The objective function specified by our collaborators is a combination 
of the absolute errors in the predicted counts and the squares of the predicted elasticities: 

4 4 4 

Z (x) = ai \yaj - Vaj I + «2 IV'^j - V'^j I + «3fcl/efcl{|y,,|>l} 

j=l i=l k=l 

where cti, 02, and a^k are constants specified by our collaborators that weight the pieces 
appropriately. Our goal is to minimize this objective function under the constraint that the 
elasticities for the insured are negative and the elasticities for the uninsured are positive. 
The elasticities can only be found by running the simulator, so this set of constraints fits 
under our unknown constraints regime. 
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Figure 6: Slices of the fitted response surface; dark shades are lower values. 



Figure [6] shows pairwise slices of the fitted response surface. The left panel shows how the 
fitted predicted surface varies as a function of the parameters for adult and child ESI, when 
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the other four parameters are held fixed at a value around that which produces the minimum 
response. The middle and right panels vary by the parameters for individual programs and 
public programs respectively. Dark shades are lower values, so it can be seen that both ESI 
parameters need to be relatively high, the child individual parameter needs to be low, and 
the other three parameters are relatively less important. The points plotted in the figure are 
the 550 total inputs sampled projected into the each of the three pairs of input coordinates. 

prob. of constraint violation prob. of constraint violation prob. of constraint violation 
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Figure 7: Slices of the fitted probability of constraint violation; dark shades are lower values; 
sampled points violating the constraint are shown with asterisks. 

Figure [7] shows the fitted probability of a constraint violation over the portions of the 
space which were routinely sampled. As seen in Figure El some regions are not well-sampled 
because they do not help in finding the minimum, the goal of the problem. These sparsely 
sampled regions do not provide much information for estimating the probability of a con- 
straint violation (which is not the primary goal of the problem), and so the estimated values 
are overly infiuenced by the prior mean. Thus we only display parts of the regions in the 
first two plots to better show the estimated probabilities. Sampled points which violated 
the constraints are shown with asterisks. One can see that the largest probabilities of con- 
straint violations occurred for large values of the ESI parameter, for jointly small values of 
the individual and child individual parameters, and for values of the public and child public 
parameters which are in the corners of the space. 

Figure M shows the progress meter (171) over the 500 optimization rounds which can be 
used as a heuristic check of convergence. As in previous examples, the noisiness in the meter 
is due to the LHD predictive grid of 100 candidates at which the lECI is evaluated in each 
round. After about 250 samples the lECI seems to have "bottomed-out" . However, further 
progress can be made to reduce the frequency and magnitude of the "up-spikes" in the 
remaining optimization rounds, and thereby obtain higher confidence that the constrained 
global minimum has been obtained. 
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Figure 8: Progress meter ([7]) for the health pohcy optimization. 



6 Discussion 



We have introduced a statistical approach to optimization under unknown constraints by an 
integrated conditional expected improvement (lECl) statistic. The idea is to consider how 
the improvement at reference locations (y) conditional on candidates (x) may be used to 
augment a design. Without considering constraints, the resulting statistic is a less greedy — 
aggregated — version of the standard expected improvement (El) sta tistic. Another way t o 



obtain a less greedy El is to raise the improvement to a power g ( ISchonlau et al.l . Il998l ). 



The lECl approach, by contrast, does not require such a tuning parameter. In the presence 
of unknown constraints, lECl allows us to coherently consider how design candidates adjust 
the improvement at reference locations believed to satisfy the constraint. Our method was 
illustrated on two synthetic examples and a motivating problem from health care policy. An 
implementation is provided in the plgp package on GRAN. 

We envisage many ways that our methodology may be extended and improved. Under- 
standing of convergence of statistical optimization algorithms is scant at best, and lECl is 
no exception. While we provide a sensible heuristic that seems to work well in our examples, 
much remains to be done in this area. It may also be sensible to model the constraint as a 
function of the inputs (x) and the real-valued response {Z{x)). An example of where this 
would be handy is when C = {x : Z{x) < k}, for some constant c. Our dual-GP modeling 
framework may easily be extended to allow uncertainty in Z (real- valued) responses to filter 
through, as predictors, into the surrogate model for th e classifica. t ion la bels. A more difficult 
extension involves accommodating hidden constraints ( Lee et all . 20101 ) 
the real- valued response fails, e.g., due to a lack of convergence in a simulation 
may be worthwhile to consider surrogate models beyond GPs. Dynamic trees for regression 



where evaluation of 
Finally, it 
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and classification show considerable promise (ITaddy et al, 



20101)- 
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